Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Loading required package: ggplot2
Attaching package: ‘Hmisc’
The following objects are masked from ‘package:base’:
format.pval, units
Warning messages:
1: In if (charToRaw(x) < 20) paste("\\u", toupper(format(as.hexmode(as.integer(charToRaw(x))), :
the condition has length > 1 and only the first element will be used
2: In if (charToRaw(x) < 20) paste("\\u", toupper(format(as.hexmode(as.integer(charToRaw(x))), :
the condition has length > 1 and only the first element will be used
3: In if (charToRaw(x) < 20) paste("\\u", toupper(format(as.hexmode(as.integer(charToRaw(x))), :
the condition has length > 1 and only the first element will be used
4: In if (charToRaw(x) < 20) paste("\\u", toupper(format(as.hexmode(as.integer(charToRaw(x))), :
the condition has length > 1 and only the first element will be used
5: In if (charToRaw(x) < 20) paste("\\u", toupper(format(as.hexmode(as.integer(charToRaw(x))), :
the condition has length > 1 and only the first element will be used
6: In if (charToRaw(x) < 20) paste("\\u", toupper(format(as.hexmode(as.integer(charToRaw(x))), :
the condition has length > 1 and only the first element will be used
7: In if (charToRaw(x) < 20) paste("\\u", toupper(format(as.hexmode(as.integer(charToRaw(x))), :
the condition has length > 1 and only the first element will be used
Attaching package: ‘dplyr’
The following objects are masked from ‘package:Hmisc’:
src, summarize
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
======================
Welcome to d3heatmap version 0.9.0
Type citation('d3heatmap') for how to cite the package.
Type ?d3heatmap for the main documentation.
The github page is: https://github.com/talgalili/d3heatmap/
Please submit your suggestions and bug-reports at: https://github.com/talgalili/d3heatmap/issues
You may ask questions at stackoverflow, use the r and d3heatmap tags:
https://stackoverflow.com/questions/tagged/d3heatmap
======================
Attaching package: ‘d3heatmap’
The following objects are masked from ‘package:base’:
print, save
Attaching package: ‘lubridate’
The following objects are masked from ‘package:base’:
date, intersect, setdiff, union
The purpose of this notebook is to give data locations, data ingestion code, and code for rudimentary analysis and visualization of COVID-19 data provided by New York Times, [NYT1].
The following steps are taken:
Ingest data
Take COVID-19 data from The New York Times, based on reports from state and local health agencies, [NYT1].
Take USA counties records data (FIPS codes, geo-coordinates, populations), [WRI1].
Merge the data.
Make data summaries and related plots.
Make corresponding geo-plots.
Do “out of the box” time series forecast.
Analyze fluctuations around time series trends.
Note that other, older repositories with COVID-19 data exist, like, [JH1, VK1].
Remark: The time series section is done for illustration purposes only. The forecasts there should not be taken seriously.
From the help of tolower:
capwords <- function(s, strict = FALSE) {
cap <- function(s) paste(toupper(substring(s, 1, 1)),
{s <- substring(s, 2); if(strict) tolower(s) else s},
sep = "", collapse = " " )
sapply(strsplit(s, split = " "), cap, USE.NAMES = !is.null(names(s)))
}
if( !exists("dfNYDataStates") ) {
dfNYDataStates <- read.csv( "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv",
colClasses = c("character", "character", "character", "integer", "integer"),
stringsAsFactors = FALSE )
colnames(dfNYDataStates) <- capwords(colnames(dfNYDataStates))
}
head(dfNYDataStates)
dfNYDataStates$DateObject <- as.POSIXct(dfNYDataStates$Date)
summary(as.data.frame(unclass(dfNYDataStates), stringsAsFactors = TRUE))
Date State Fips Cases Deaths DateObject
2021-09-22: 56 Washington : 653 53 : 653 Min. : 1 Min. : 0 Min. :2020-01-21 00:00:00
2021-09-23: 56 Illinois : 650 17 : 650 1st Qu.: 19940 1st Qu.: 420 1st Qu.:2020-08-02 00:00:00
2021-09-24: 56 California : 649 06 : 649 Median : 124641 Median : 2288 Median :2021-01-02 00:00:00
2021-09-25: 56 Arizona : 648 04 : 648 Mean : 359504 Mean : 6678 Mean :2021-01-02 01:00:00
2021-09-26: 56 Massachusetts: 642 25 : 642 3rd Qu.: 453738 3rd Qu.: 7937 3rd Qu.:2021-06-04 00:00:00
2021-09-27: 56 Wisconsin : 638 55 : 638 Max. :4943059 Max. :72471 Max. :2021-11-03 00:00:00
(Other) :33326 (Other) :29782 (Other):29782
Summary by state:
by( data = as.data.frame(unclass(dfNYDataStates)), INDICES = dfNYDataStates$State, FUN = summary )
Alternative summary:
Hmisc::describe(dfNYDataStates)
if(!exists("dfNYDataCounties") ) {
dfNYDataCounties <- read.csv( "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv",
colClasses = c("character", "character", "character", "character", "integer", "integer"),
stringsAsFactors = FALSE )
colnames(dfNYDataCounties) <- capwords(colnames(dfNYDataCounties))
}
head(dfNYDataCounties)
dfNYDataCounties$DateObject <- as.POSIXct(dfNYDataCounties$Date)
summary(as.data.frame(unclass(dfNYDataCounties), stringsAsFactors = TRUE))
Date County State Fips Cases Deaths DateObject
2021-09-03: 3251 Washington: 18294 Texas : 144980 : 17199 Min. : 0 Min. : 0.0 Min. :2020-01-21 00:00:00
2021-10-11: 3251 Unknown : 15495 Georgia : 94329 53061 : 653 1st Qu.: 207 1st Qu.: 3.0 1st Qu.:2020-08-26 00:00:00
2021-04-05: 3250 Jefferson : 15325 Virginia: 77719 17031 : 650 Median : 1116 Median : 21.0 Median :2021-01-18 00:00:00
2021-08-03: 3250 Franklin : 14682 Kentucky: 69652 06059 : 649 Mean : 6432 Mean : 122.2 Mean :2021-01-16 23:56:00
2021-08-04: 3250 Jackson : 13983 Missouri: 67253 04013 : 648 3rd Qu.: 3784 3rd Qu.: 72.0 3rd Qu.:2021-06-12 00:00:00
2021-08-10: 3250 Lincoln : 13945 Illinois: 59829 06037 : 648 Max. :1497297 Max. :34601.0 Max. :2021-11-03 00:00:00
(Other) :1862094 (Other) :1789872 (Other) :1367834 (Other):1861149 NA's :42707
if(!exists("dfUSACountyData")){
dfUSACountyData <- read.csv( "https://raw.githubusercontent.com/antononcube/SystemModeling/master/Data/dfUSACountyRecords.csv",
colClasses = c("character", "character", "character", "character", "integer", "numeric", "numeric"),
stringsAsFactors = FALSE )
}
head(dfUSACountyData)
summary(as.data.frame(unclass(dfUSACountyData), stringsAsFactors = TRUE))
Country State County FIPS Population Lat Lon
UnitedStates:3143 Texas : 254 WashingtonCounty: 30 01001 : 1 Min. : 89 Min. :19.60 Min. :-166.90
Georgia : 159 JeffersonCounty : 25 01003 : 1 1st Qu.: 10980 1st Qu.:34.70 1st Qu.: -98.23
Virginia: 134 FranklinCounty : 24 01005 : 1 Median : 25690 Median :38.37 Median : -90.40
Kentucky: 120 JacksonCounty : 23 01007 : 1 Mean : 102248 Mean :38.46 Mean : -92.28
Missouri: 115 LincolnCounty : 23 01009 : 1 3rd Qu.: 67507 3rd Qu.:41.81 3rd Qu.: -83.43
Kansas : 105 MadisonCounty : 19 01011 : 1 Max. :10170292 Max. :69.30 Max. : -67.63
(Other) :2256 (Other) :2999 (Other):3137
dsNYDataCountiesExtended <-
dfNYDataCounties %>%
dplyr::inner_join( dfUSACountyData %>% dplyr::select_at( .vars = c("FIPS", "Lat", "Lon", "Population") ), by = c( "Fips" = "FIPS" ) )
dsNYDataCountiesExtended
ParetoPlotForColumns( as.data.frame(lapply(dsNYDataCountiesExtended[,c("Cases", "Deaths")], as.numeric)), c("Cases", "Deaths"), scales = "free" )
Note that in the plots in this sub-section we filter out Hawaii and Alaska.
ggplot2::ggplot(dsNYDataCountiesExtended[ dsNYDataCountiesExtended$Lon > -130, c("Lat", "Lon", "Cases")]) +
ggplot2::geom_point( ggplot2::aes(x = Lon, y = Lat, fill = log10(Cases)), alpha = 0.01, size = 0.5, color = "blue" ) +
ggplot2::coord_quickmap()
The most recent versions of leaflet RStudio are having problems with the visualization below.
cf <- colorBin( palette = "Reds", domain = log10(dsNYDataCountiesExtended$Cases), bins = 10 )
m <-
leaflet( dsNYDataCountiesExtended[, c("Lat", "Lon", "Cases")] ) %>%
addTiles() %>%
addCircleMarkers( ~Lon, ~Lat, radius = ~ log10(Cases), fillColor = ~ cf(log10(Cases)), color = ~ cf(log10(Cases)), fillOpacity = 0.8, stroke = FALSE, popup = ~Cases )
m
dsNYDataCountiesExtended
An alternative of the geo-visualization is to use a heat-map plot.
Make a heat-map plot by sorting the rows of the cross-tabulation matrix (that correspond to states):
matSDC <- xtabs( Cases ~ State + Date, dfNYDataStates, sparse = TRUE)
d3heatmap::d3heatmap( log10(matSDC+1), cellnote = as.matrix(matSDC), scale = "none", dendrogram = "row", colors = "Blues") #, theme = "dark")
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette RdYlBu is 11
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette RdYlBu is 11
Returning the palette you asked for with that many colors
Cross-tabulate states with dates over deaths and plot:
matSDD <- xtabs( Deaths ~ State + Date, dfNYDataStates, sparse = TRUE)
d3heatmap::d3heatmap( log10(matSDD+1), cellnote = as.matrix(matSDD), scale = "none", dendrogram = "row", colors = "Blues") #, theme = "dark")
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette RdYlBu is 11
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette RdYlBu is 11
Returning the palette you asked for with that many colors
In this section we do simple “forecasting” (not a serious attempt).
Make time series data frame in long form:
dfQuery <-
dfNYDataStates %>%
dplyr::group_by( Date, DateObject ) %>%
dplyr::summarise_at( .vars = c("Cases", "Deaths"), .funs = sum )
dfQueryLongForm <- tidyr::pivot_longer( dfQuery, cols = c("Cases", "Deaths"), names_to = "Variable", values_to = "Value")
head(dfQueryLongForm)
Plot the time series:
ggplot(dfQueryLongForm) +
geom_line( aes( x = DateObject, y = log10(Value) ) ) +
facet_wrap( ~Variable, ncol = 1 )
Fit using ARIMA:
fit <- forecast::auto.arima( dfQuery$Cases )
fit
Series: dfQuery$Cases
ARIMA(2,2,2)
Coefficients:
ar1 ar2 ma1 ma2
0.8073 -0.452 -1.6198 0.8419
s.e. 0.0406 0.047 0.0221 0.0251
sigma^2 estimated as 725101702: log likelihood=-7563.74
AIC=15137.49 AICc=15137.58 BIC=15159.88
Plot “forecast”:
plot( forecast::forecast(fit, h = 20) )
grid(nx = NULL, ny = NULL, col = "lightgray", lty = "dotted")
Fit with ARIMA:
fit <- forecast::auto.arima( dfQuery$Deaths )
fit
Series: dfQuery$Deaths
ARIMA(3,2,2)
Coefficients:
ar1 ar2 ar3 ma1 ma2
0.9694 -0.6119 -0.1969 -1.4248 0.6841
s.e. 0.0495 0.0545 0.0477 0.0368 0.0302
sigma^2 estimated as 171619: log likelihood=-4845.9
AIC=9703.81 AICc=9703.94 BIC=9730.68
Plot “forecast”:
plot( forecast::forecast(fit, h = 20) )
grid(nx = NULL, ny = NULL, col = "lightgray", lty = "dotted")
We want to see does the time series data have fluctuations around its trends and estimate the distributions of those fluctuations. (Knowing those distributions some further studies can be done.)
This can be efficiently using the software monad QRMon, [AAp2, AA1]. Here we load the QRMon package:
#devtools::install_github(repo = "antononcube/QRMon-R")
library(QRMon)
Warning: replacing previous import ‘magrittr::set_names’ by ‘purrr::set_names’ when loading ‘QRMon’
Here we plot the consecutive differences of the cases and deaths:
dfQueryLongForm <-
dfQueryLongForm %>%
dplyr::group_by( Variable ) %>%
dplyr::arrange( DateObject ) %>%
dplyr::mutate( Difference = c(0, diff(Value) ) ) %>%
dplyr::ungroup()
ggplot(dfQueryLongForm) +
geom_line( aes( x = DateObject, y = Difference ) ) +
facet_wrap( ~Variable, ncol = 1, scales = "free_y" )
From the plots we see that time series are not monotonically increasing, and there are non-trivial fluctuations in the data.
Here we take interesting part of the cases data:
dfQueryLongForm2 <-
dfQueryLongForm %>%
dplyr::filter( difftime( DateObject, as.POSIXct("2020-05-01")) >= 0 ) %>%
dplyr::mutate( Regressor = as.numeric(DateObject, origin = "1900-01-01") )
Here we specify a QRMon workflow that rescales the data, fits a B-spline curve to get the trend, and finds the absolute and relative errors (residuals, fluctuations) around that trend:
qrObj <-
QRMonUnit(dfQueryLongForm2 %>% dplyr::filter( Variable == "Cases") %>% dplyr::select( Regressor, Value) ) %>%
QRMonRescale(regressorAxisQ = F, valueAxisQ = T) %>%
QRMonEchoDataSummary %>%
QRMonQuantileRegression( df = 16, probabilities = 0.5 )
$Dimensions
[1] 552 2
$Summary
Regressor Value
Min. :1.588e+09 Min. :0.0000
1st Qu.:1.600e+09 1st Qu.:0.1227
Median :1.612e+09 Median :0.5582
Mean :1.612e+09 Mean :0.4606
3rd Qu.:1.624e+09 3rd Qu.:0.7180
Max. :1.636e+09 Max. :1.0000
Here we plot the fit:
qrObj <- qrObj %>% QRMonPlot(datePlotQ = T)
Here we plot absolute errors:
qrObj <- qrObj %>% QRMonErrorsPlot(relativeErrorsQ = F, datePlotQ = T)
Here is the summary:
summary( (qrObj %>% QRMonErrors(relativeErrorsQ = F) %>% QRMonTakeValue)[[1]]$Error )
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.0046392 -0.0007593 0.0000000 0.0001071 0.0007600 0.0082391
Here we plot relative errors:
qrObj <- qrObj %>% QRMonErrorsPlot(relativeErrorsQ = T, datePlotQ = T)
Here is the summary:
summary( (qrObj %>% QRMonErrors(relativeErrorsQ = T) %>% QRMonTakeValue)[[1]]$Error )
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.445265 -0.002165 0.000000 -0.005389 0.002182 0.169684
[NYT1] The New York Times, Coronavirus (Covid-19) Data in the United States, (2020), GitHub.
[WRI1] Wolfram Research Inc., USA county records, (2020), System Modeling at GitHub.
[JH1] CSSE at Johns Hopkins University, COVID-19, (2020), GitHub.
[VK1] Vitaliy Kaurov, Resources For Novel Coronavirus COVID-19, (2020), community.wolfram.com.
[AA1] Anton Antonov, “A monad for Quantile Regression workflows”, (2018), at MathematicaForPrediction WordPress.
[AAp1] Anton Antonov, Heatmap plot Mathematica package, (2018), MathematicaForPrediciton at GitHub.
[AAp2] Anton Antonov, Monadic Quantile Regression Mathematica package, (2018), MathematicaForPrediciton at GitHub.